Question A: Classifying volcanic rock

In this question, you are allowed to use any method available to you in R, either programmed yourself, or from packages of your choosing. The important thing here is that you communicate your results very well, in terms of language, illustration, and precision.

In this question you will analyse a dataset containing volcanic rocks and their chemical composition, collected from various sites in New Zealand. You will predict whether the Source of a given rock sample is “Okataina” (a geographical location), or whether the sample comes from somewhere else.

Load the annotated dataset rocks.csv. and the unlabelled dataset rocks_unlabelled.csv. You will find that rows in the latter set do not contain an entry for Source, i.e., they are of unknown origin. Your goal will be to fill these blanks, using the techniques you have learned in this course so far. The following guidelines might be useful to consider:

Data Loading and Preprocessing

# Load required libraries
library(tidyverse)    # For data manipulation and visualization
library(readr)        # For reading CSV files
library(caret)        # For model training and preprocessing
library(knitr)        # For neat table rendering

# Load labeled and unlabeled datasets
rocks <- read_csv("../data/rocks.csv", col_types = cols(...1 = col_skip()))
rocks_unlabelled <- read_csv("../data/rocks_unlabelled.csv", col_types = cols(...1 = col_skip()))


# Inspect the structure of the datasets
# Check for missing values

skim(rocks)
Data summary
Name rocks
Number of rows 330
Number of columns 12
_______________________
Column type frequency:
logical 1
numeric 11
________________________
Group variables None

Variable type: logical

skim_variable n_missing complete_rate mean count
okataina 0 1 0.88 TRU: 291, FAL: 39

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
SiO2 0 1 75.62 3.24 51.29 75.23 76.34 77.26 78.66 ▁▁▁▁▇
TiO2 0 1 0.21 0.15 0.10 0.12 0.18 0.24 1.08 ▇▁▁▁▁
Al2O3 0 1 13.52 1.17 12.18 12.73 13.24 14.01 19.04 ▇▃▁▁▁
Fe2O3 0 1 1.87 1.31 0.43 1.32 1.56 1.88 11.85 ▇▁▁▁▁
MnO 0 1 0.05 0.02 0.01 0.04 0.05 0.06 0.20 ▇▇▁▁▁
MgO 0 1 0.37 0.60 0.07 0.18 0.24 0.32 5.63 ▇▁▁▁▁
CaO 0 1 1.26 0.81 0.64 0.87 1.04 1.32 8.88 ▇▁▁▁▁
Na2O 0 1 3.64 0.44 1.77 3.36 3.63 3.97 4.60 ▁▁▅▇▃
K2O 0 1 3.44 0.57 0.60 3.18 3.54 3.83 5.15 ▁▁▅▇▁
P2O5 0 1 0.02 0.01 0.00 0.01 0.01 0.02 0.09 ▇▅▁▁▁
LOI 0 1 3.53 0.92 0.19 3.09 3.64 4.15 5.88 ▁▂▆▇▁
skim(rocks_unlabelled)
Data summary
Name rocks_unlabelled
Number of rows 30
Number of columns 11
_______________________
Column type frequency:
numeric 11
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
SiO2 2 0.93 75.61 2.01 71.15 74.15 75.97 77.34 77.88 ▂▃▂▇▇
TiO2 2 0.93 0.20 0.08 0.11 0.14 0.17 0.27 0.34 ▇▃▃▂▃
Al2O3 2 0.93 13.79 1.17 12.45 12.75 13.67 14.42 16.63 ▇▇▃▁▂
Fe2O3 2 0.93 1.68 0.43 1.10 1.36 1.61 2.01 2.51 ▇▇▃▆▃
MnO 2 0.93 0.05 0.01 0.04 0.04 0.05 0.05 0.06 ▇▁▆▁▃
MgO 2 0.93 0.30 0.14 0.10 0.22 0.26 0.38 0.60 ▂▇▂▁▂
CaO 2 0.93 1.31 0.53 0.76 0.94 1.08 1.83 2.55 ▇▂▁▂▁
Na2O 2 0.93 3.48 0.29 2.93 3.27 3.54 3.63 4.15 ▂▂▇▁▁
K2O 2 0.93 3.57 0.45 2.28 3.33 3.75 3.89 4.10 ▁▁▃▂▇
P2O5 2 0.93 0.01 0.00 0.01 0.01 0.01 0.01 0.03 ▇▁▁▁▁
LOI 2 0.93 3.77 0.68 1.67 3.34 4.03 4.18 4.73 ▁▁▅▇▆

We imported two datasets:

rocks.csv: labelled training data including the okataina variable.

rocks_unlabelled.csv: data without labels, to be predicted later.

🪾Tree Method: Random Forest

RF-Tree Training

# Load required packages
library(randomForest)   # For random forest classifier
library(caret)          # For cross-validation and evaluation
library(pROC)           # For AUC calculation
set.seed(62380486)      # For reproducibility

# Recode the logical target variable into valid factor labels
rocks$okataina <- factor(rocks$okataina, levels = c(FALSE, TRUE), labels = c("No", "Yes"))

# Define 10-fold cross-validation control
ctrl <- trainControl(
  method = "cv",
  number = 10,
  classProbs = TRUE,                   # Needed for AUC
  summaryFunction = twoClassSummary,  # ROC, Sens, Spec
  savePredictions = TRUE
)

# Train Random Forest model
mtry_grid <- expand.grid(mtry = 2:11)  # explicitly specified a tuning grid ranging from mtry = 2 to 11
tree.rf <- train(
  okataina ~ ., data = rocks,
  method = "rf",
  metric = "ROC",                     # AUC is the performance metric
  trControl = ctrl,
  tuneGrid = mtry_grid               # the default mtry serachign range is 3, mannually set to 10.
)   

# Output model summary
print(tree.rf)
## Random Forest 
## 
## 330 samples
##  11 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 297, 297, 297, 297, 297, 296, ... 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens   Spec     
##    2    0.9793103  0.400  1.0000000
##    3    0.9732759  0.450  0.9965517
##    4    0.9750000  0.500  0.9932184
##    5    0.9659770  0.525  0.9897701
##    6    0.9638218  0.500  0.9897701
##    7    0.9634195  0.525  0.9897701
##    8    0.9647126  0.500  0.9897701
##    9    0.9595402  0.525  0.9863218
##   10    0.9565230  0.475  0.9863218
##   11    0.9461782  0.475  0.9863218
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
plot(tree.rf)

The mtry parameter in Random Forest controls the number of predictors randomly selected at each split.

To explore the effect of the number of predictors at each split (mtry), we specified a tuning grid from mtry = 2 to 11. This allowed the model to evaluate all candidate values through 10-fold cross-validation and select the optimal setting based on the highest AUC. The expanded search space offers more detailed insights into model performance and helps avoid suboptimal defaults.

In this case, the optimal mtry is 2, selected from a total of p = 11 features. By randomly selecting m features from p features, we introduced randomness into the tree-growing process. This approach increased diversity among trees, reduced correlation, and enhanced the ensemble’s ability to reduce variance(Li lecture 2025).

In this model, mtry = 2 resulted in the highest cross-validated ROC of ~0.98, indicating the best overall discriminative performance. Therefore, it was chosen as the optimal setting.

Confusion Matrix

# Recompute confusion matrix using best cross-validated predictions
# Note: Use predictions corresponding to the best mtry model (mtry = 2)

# Filter predictions to match selected tuning parameters
best_pred <- tree.rf$pred %>%
  filter(mtry == tree.rf$bestTune$mtry)

# Generate confusion matrix
conf_matrix <- confusionMatrix(
  data = best_pred$pred,
  reference = best_pred$obs,
  positive = "Yes"  # "Yes" is the target class (Okataina)
)

# Display confusion matrix
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No   15   0
##        Yes  24 291
##                                           
##                Accuracy : 0.9273          
##                  95% CI : (0.8937, 0.9528)
##     No Information Rate : 0.8818          
##     P-Value [Acc > NIR] : 0.004544        
##                                           
##                   Kappa : 0.5243          
##                                           
##  Mcnemar's Test P-Value : 2.668e-06       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.3846          
##          Pos Pred Value : 0.9238          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.8818          
##          Detection Rate : 0.8818          
##    Detection Prevalence : 0.9545          
##       Balanced Accuracy : 0.6923          
##                                           
##        'Positive' Class : Yes             
## 

ROC Curve and AUC

# Compute ROC curve
roc_curve <- roc(
  response = best_pred$obs,
  predictor = best_pred$Yes,    # predicted probability for class "Yes"
  levels = c("No", "Yes"),
  direction = "<"
)

# Plot ROC curve
plot(roc_curve,
     col = "blue",
     lwd = 2,
     main = "Random Forest ROC Curve (10-fold CV)")
abline(a = 0, b = 1, lty = 2, col = "gray")  # diagonal reference line

# Report AUC value
auc(roc_curve)
## Area under the curve: 0.9767

Feature Importance Visualisation

# Variable importance from the caret-trained model
varImpPlot(tree.rf$finalModel, main = "Variable Importance - Random Forest")

# Alternatively, get numeric importance values
importance_df <- as.data.frame(varImp(tree.rf)$importance)
importance_df <- importance_df %>%
  rownames_to_column("Feature") %>%
  arrange(desc(Overall))

# Pretty barplot using ggplot2
library(ggplot2)
ggplot(importance_df, aes(x = reorder(Feature, Overall), y = Overall)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Feature Importance in Random Forest Model",
       x = "Feature", y = "Importance (Gini-based)") +
  theme_minimal()

We visualized feature importance using both varImpPlot() from the randomForest package and a custom ggplot2 barplot based on caret::varImp(). While both methods rank features similarly, they differ in the scale of the x-axis.

varImpPlot() reports raw MeanDecreaseGini scores, reflecting the average reduction in node impurity attributable to each variable across the ensemble. In contrast, caret::varImp() returns normalized importance scores on a common scale, which enhances interpretability and aesthetic consistency for visual comparison.

Despite the difference in scale, both plots consistently identify the top predictors, reinforcing their significance in the classification task.

Feature Importance Summary

Random Forest provides an intrinsic measure of feature importance by evaluating the decrease in node impurity (e.g., Gini index) attributable to each feature across all trees. Features that contribute to more effective splits—especially near the root nodes—are ranked higher.

The importance plot above reveals that K2O and CaO are the most influential in distinguishing between Okataina and non-Okataina rocks. These variables likely exhibit substantial distributional differences between the classes or interact strongly with other predictors. In contrast, features with low importance may be either weakly informative or redundant in the presence of others.

Such insights can inform geochemical domain knowledge and guide further feature selection or dimensionality reduction.

Evaluation Metrics Explanation

The confusion matrix summarizes the classification performance based on the best cross-validated model. While the model achieves perfect sensitivity, the specificity is considerably low (~0.38), indicating that many non-Okataina samples are misclassified as Okataina. This suggests that the rf_tree model is overly eager to assign the positive class.

Additionally, here we care more about the indicator of model performance specificity, which measures the proportion of correctly identified non-Okataina samples. In the context of geological classification, false positives (i.e., misclassifying a non-Okataina rock as Okataina) may lead to incorrect provenance attribution and flawed geological interpretations. Prioritizing high specificity ensures that when the model predicts a rock as coming from Okataina, it does so with high confidence, minimizing spurious identifications.

The low specificity observed in this model may be a result of class imbalance: the okataina variable is skewed, with approximately 88% ‘Yes’ and only 12% ‘No’. This imbalance could bias the model toward predicting the dominant class, reducing its ability to correctly identify the minority class.

table(rocks$okataina)
## 
##  No Yes 
##  39 291
prop.table(table(rocks$okataina))
## 
##        No       Yes 
## 0.1181818 0.8818182

Non-tree Method Selection

Comparison of Non-Tree Classification Methods

Before selecting an alternative modeling approach to Random Forest, we briefly compare several widely used non-tree classifiers. Each has distinct assumptions and strengths that make them suitable for different types of data.

Method Assumptions Strengths Limitations Scale Sensitivity
kNN No parametric assumption; relies on distance Simple, non-parametric, interpretable Sensitive to irrelevant features and scaling ✅ Yes
QDA/LDA Gaussian distribution; equal/unequal covariance Efficient with small data; interpretable decision boundary Strong assumptions; sensitive to outliers ✅ Yes
SVM Maximizes margin between classes Effective in high-dimensional space; robust to overfitting Computationally expensive; less interpretable ✅ Yes
Logistic Regression Linear decision boundary; independent features Probabilistic output; interpretable coefficients Struggles with non-linear boundaries ✅ Yes

Summary Insights

  • k-Nearest Neighbors (kNN) makes no distributional assumptions and is ideal for exploratory use, but it is highly affected by the scale of input features and irrelevant variables.

  • Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) assume normality and are powerful when those assumptions hold, though they are sensitive to violations such as multicollinearity or class imbalance.

  • Support Vector Machines (SVM) are robust and powerful for both linear and non-linear boundaries (via kernels), but their output is less interpretable and requires careful tuning.

  • Logistic Regression is a solid baseline method that provides interpretable coefficients and class probabilities, but performs poorly with non-linear class boundaries. All the above models require feature standardization, as they rely on distance or assume standardized inputs.

All the above models require feature standardization, as they rely on distance or assume standardized inputs.

Method Decision

Given the goals of explainability, the weak dependence assumption, and performance evaluation, we suggest applying the k-NN method in this case.

kNN Method

Z-score Scaling

# Define the name of the response variable to avoid hardcoding
label_name <- "okataina"

# Create a preprocessing object that centers and scales all predictor variables.
# This step computes the mean and standard deviation for each numeric feature.
# It excludes the response variable ("okataina") from transformation.
preproc_knn <- preProcess(
  rocks[, setdiff(names(rocks), label_name)], 
  method = c("center", "scale")
)

# Apply the preprocessing model to the predictors to standardize them.
# Each feature is rescaled to have mean = 0 and standard deviation = 1.
rocks_knn_scaled <- predict(
  preproc_knn, 
  rocks[, setdiff(names(rocks), label_name)]
)

# Append the original response variable ("okataina") back to the processed data
# so that the full dataset can be used for classification modeling.
rocks_knn_scaled[[label_name]] <- rocks[[label_name]]

All predictor variables were standardized using Z-score normalization, which ensures that each feature has zero mean and unit variance. This step is essential for k-Nearest Neighbors (kNN), which is sensitive to the scale of input variables. The response variable “okataina” was excluded from this transformation and appended back afterward for modeling.

kNN Model Training and Cross-validation Evaluation

# Load required libraries
library(caret)
library(pROC)

# Set seed for reproducibility
set.seed(62380486)

# Define 10-fold cross-validation control
ctrl_knn <- trainControl(
  method = "cv",              # k-fold cross-validation
  number = 10,                # 10 folds
  classProbs = TRUE,          # Needed for ROC and AUC
  summaryFunction = twoClassSummary,  # Evaluation metric: ROC, Sensitivity, Specificity
  savePredictions = TRUE      # Store predictions for later analysis
)

# Train kNN model using caret
knn_model <- train(
  okataina ~ ., data = rocks_knn_scaled,
  method = "knn",
  metric = "ROC",             # Use AUC as selection metric
  trControl = ctrl_knn,
  tuneLength = 10             # Try 10 different k values automatically
)

# Display model summary
print(knn_model)
## k-Nearest Neighbors 
## 
## 330 samples
##  11 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 297, 297, 297, 297, 297, 296, ... 
## Resampling results across tuning parameters:
## 
##   k   ROC        Sens       Spec     
##    5  0.8682615  0.4750000  0.9862069
##    7  0.8737213  0.3666667  0.9896552
##    9  0.8971552  0.2916667  0.9896552
##   11  0.9016523  0.2416667  0.9931034
##   13  0.8957328  0.2166667  1.0000000
##   15  0.8955891  0.2166667  1.0000000
##   17  0.8846839  0.1666667  1.0000000
##   19  0.8740086  0.1000000  1.0000000
##   21  0.8671121  0.1000000  1.0000000
##   23  0.8700000  0.0750000  1.0000000
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 11.
# Plot ROC vs k
plot(knn_model, main = "kNN Cross-validated AUC vs k")

From the model summary above, we choose our best k=11.

Model Performance Evaluation

# Extract predictions from best-tuned model
best_knn_pred <- knn_model$pred %>%
  filter(k == knn_model$bestTune$k)

# Generate confusion matrix using cross-validated predictions
conf_matrix_knn <- confusionMatrix(
  data = best_knn_pred$pred,
  reference = best_knn_pred$obs,
  positive = "Yes"
)

print(conf_matrix_knn)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No    9   2
##        Yes  30 289
##                                           
##                Accuracy : 0.903           
##                  95% CI : (0.8659, 0.9327)
##     No Information Rate : 0.8818          
##     P-Value [Acc > NIR] : 0.1325          
##                                           
##                   Kappa : 0.3249          
##                                           
##  Mcnemar's Test P-Value : 1.815e-06       
##                                           
##             Sensitivity : 0.9931          
##             Specificity : 0.2308          
##          Pos Pred Value : 0.9060          
##          Neg Pred Value : 0.8182          
##              Prevalence : 0.8818          
##          Detection Rate : 0.8758          
##    Detection Prevalence : 0.9667          
##       Balanced Accuracy : 0.6119          
##                                           
##        'Positive' Class : Yes             
## 

The kNN model was trained using 10-fold cross-validation, with automatic tuning over a range of k values. The optimal number of neighbors was selected at k=11, based on the highest ROC=~0.90 .

Evaluation using the best model revealed Sensitivity=0.99, Specificity = 0.23, as shown in the confusion matrix.

ROC Curve and AUC

# Compute ROC curve for best model
roc_knn <- roc(
  response = best_knn_pred$obs,
  predictor = best_knn_pred$Yes,    # Probability of class "Yes"
  levels = c("No", "Yes")
)

# Plot ROC curve
plot(roc_knn,
     col = "darkgreen",
     lwd = 2,
     main = "kNN ROC Curve (10-fold Cross-validation)")
abline(a = 0, b = 1, lty = 2, col = "gray")

# Output AUC value
auc(roc_knn)
## Area under the curve: 0.8958

The ROC curve confirms strong discriminative power, with an AUC of 0.8958.

Comparison between RF-tree and kNN

Now, we put key performance values of random forest tree and kNN together.

# Extract best predictions from each model
rf_pred_best <- tree.rf$pred %>%
  filter(mtry == tree.rf$bestTune$mtry)

knn_pred_best <- knn_model$pred %>%
  filter(k == knn_model$bestTune$k)

# Compute confusion matrices
rf_conf <- confusionMatrix(rf_pred_best$pred, rf_pred_best$obs, positive = "Yes")
knn_conf <- confusionMatrix(knn_pred_best$pred, knn_pred_best$obs, positive = "Yes")

# Build comparison table
comparison <- tibble::tibble(
  Model = c("Random Forest", "kNN"),
  ROC = c(
    max(tree.rf$results$ROC),
    max(knn_model$results$ROC)
  ),
  Sensitivity = c(
    rf_conf$byClass["Sensitivity"],
    knn_conf$byClass["Sensitivity"]
  ),
  Specificity = c(
    rf_conf$byClass["Specificity"],
    knn_conf$byClass["Specificity"]
  )
)

# Round and display
comparison %>%
  mutate(across(where(is.numeric), round, 3)) %>%
  kable(caption = "Cross-validated Performance of Random Forest and kNN (based on actual predictions)")
Cross-validated Performance of Random Forest and kNN (based on actual predictions)
Model ROC Sensitivity Specificity
Random Forest 0.979 1.000 0.385
kNN 0.902 0.993 0.231

The table above compares the performance of the Random Forest and k-Nearest Neighbors classifiers(kNN). The Random Forest model achieved the highest AUC (~0.98), perfect sensitivity (1.00), and higher specificity (0.39), suggesting it performs well in both correctly identifying Okataina samples and avoiding false positives.

In contrast, the kNN model showed slightly lower sensitivity (0.99) and lower specificity (0.23), meaning it was both slightly less accurate in detecting Okataina and more prone to incorrectly labeling non-Okataina samples as Okataina. Despite the minor drop in recall, the model appears to assign the Okataina label more freely, which increases the risk of false positives.

Given our earlier emphasis on specificity, Random Forest may be preferable.

Discussion

Threshold Chosen by Youden Index(Specificity Concern Improvement)

To solve the rf-tree’s low specificity problem, we could manually adjust the classification threshold to increase our confidence in positive predictions.

By default, classification models assign the “Yes” label (i.e., positive class) when the predicted probability exceeds 0.5. However, when specificity is prioritized—such as in this geological classification task, where false positives may lead to misinterpretation of provenance—it is advisable to raise this threshold. Doing so makes the model more conservative in assigning the “Yes” class, thereby reducing the false positive rate and improving specificity.

This approach represents a post-hoc calibration technique where we trade off sensitivity for higher specificity. We can identify an optimal threshold by evaluating metrics such as the ROC curve, precision-recall tradeoff, or maximizing the Youden index (sensitivity + specificity - 1).

# Define threshold values to evaluate
thresholds <- seq(0.1, 0.9, by = 0.05)

# Initialize empty results frame
youden_df <- data.frame()

# Loop over thresholds to compute sensitivity, specificity, and Youden index
for (t in thresholds) {
  pred_label <- ifelse(rf_pred_best$Yes > t, "Yes", "No") %>%
    factor(levels = c("No", "Yes"))
  
  cm <- confusionMatrix(pred_label, rf_pred_best$obs, positive = "Yes")
  
  sens <- cm$byClass["Sensitivity"]
  spec <- cm$byClass["Specificity"]
  youden <- sens + spec - 1
  
  youden_df <- rbind(youden_df, data.frame(
    Threshold = t,
    Sensitivity = round(sens, 3),
    Specificity = round(spec, 3),
    YoudenIndex = round(youden, 3)
  ))
}

# Show table
youden_df
# Get the top 5 thresholds with the highest Youden Index
top5_youden <- youden_df %>%
  arrange(desc(YoudenIndex)) %>%
  head(5)

# Show the top 5 thresholds
top5_youden

To identify the optimal decision threshold, we plotted sensitivity, specificity, and the Youden Index across a range of thresholds from 0.1 to 0.9. The curve shows how increasing the threshold improves specificity while reducing sensitivity.

The Youden Index peaks at a threshold of 0.75 and 0.80, suggesting these provide the best trade-off between true positive and true negative rates. This visualization supports threshold tuning as a post-hoc calibration method when the default cutoff (0.5) does not align with domain-specific priorities.

However, it is not always realistic to choose the threshold value at the peak of the Youden Index, considering that real-world data distributions vary. By experiment couple times, we set our threshold at 0.70.

Confusion Matrix @ Threshold = 0.70

# Manually set threshold = 0.70
threshold <- 0.70

# Classify predictions based on threshold
rf_pred_best$custom_pred <- ifelse(rf_pred_best$Yes > threshold, "Yes", "No") %>%
  factor(levels = c("No", "Yes"))

# Compute confusion matrix
cm_80 <- confusionMatrix(rf_pred_best$custom_pred, rf_pred_best$obs, positive = "Yes")
print(cm_80)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No   31   4
##        Yes   8 287
##                                           
##                Accuracy : 0.9636          
##                  95% CI : (0.9373, 0.9811)
##     No Information Rate : 0.8818          
##     P-Value [Acc > NIR] : 1.239e-07       
##                                           
##                   Kappa : 0.8174          
##                                           
##  Mcnemar's Test P-Value : 0.3865          
##                                           
##             Sensitivity : 0.9863          
##             Specificity : 0.7949          
##          Pos Pred Value : 0.9729          
##          Neg Pred Value : 0.8857          
##              Prevalence : 0.8818          
##          Detection Rate : 0.8697          
##    Detection Prevalence : 0.8939          
##       Balanced Accuracy : 0.8906          
##                                           
##        'Positive' Class : Yes             
## 

ROC Curve with Threshold Marker

Using a manually selected threshold of 0.70, we recalculated the classification results for the Random Forest model. This threshold yielded a specificity of 0.79 and a sensitivity of 0.99—providing a strong balance while prioritizing low false positive rates.

The ROC curve includes a marker at this threshold, showing the trade-off between sensitivity and specificity. A density plot of predicted probabilities by class further visualizes the separation between “Yes” and “No” samples relative to the threshold.

Final Prediction on Unlabeled Data with Random Forest (threshold = 0.70)

# Step 1: Remove rows with missing values from the unlabeled dataset
rocks_unlabelled_clean <- na.omit(rocks_unlabelled)

# Step 2: Standardize the predictors using the same model from training
rocks_unlabelled_scaled <- predict(preproc_knn, rocks_unlabelled_clean)

# Step 3: Predict class probabilities using the trained Random Forest model
rf_probs <- predict(tree.rf, newdata = rocks_unlabelled_scaled, type = "prob")

# Step 4: Apply threshold = 0.70 to assign class labels
rf_pred_class <- ifelse(rf_probs$Yes > 0.70, "Okataina", "Not Okataina")

# Step 5: Combine results with predicted class and probability
rf_pred_result <- rocks_unlabelled_clean %>%
  mutate(
    Predicted_Source = rf_pred_class,
    Probability_Okataina = round(rf_probs$Yes, 3)
  )

# Step 6: View first few predictions
rf_pred_result[, c("Predicted_Source", "Probability_Okataina")]

Predicted Class Distribution (Count of “Okataina”)

# Count predicted categories
table(rf_pred_result$Predicted_Source)
## 
## Not Okataina     Okataina 
##            7           21
# Show proportion
prop.table(table(rf_pred_result$Predicted_Source))
## 
## Not Okataina     Okataina 
##         0.25         0.75

The histogram of predicted probabilities indicates the confidence spread, while the sorted bar chart clearly identifies which samples have the strongest predicted likelihood of being from Okataina. These visualizations support interpretation and downstream selection of high-confidence samples.

The distribution of predicted classes for the unlabelled samples shows that X out of Y were classified as “okataina” under the 0.70 threshold.

Limitations of the Random Forest Tree

While the Random Forest (RF) model achieved high overall classification performance (AUC ~0.90), several limitations suggest potential concerns regarding its generalizability to new or unseen data:

  1. Overfitting Risk on Imbalanced Data
    The dataset exhibits strong class imbalance, with over 88% of samples labelled as “Okataina”. Although RF can handle imbalance better than single trees, the model may still overfit to the majority class, leading to inflated accuracy and potentially low recall on minority classes.

  2. Low Specificity in Default Settings
    Without manual threshold tuning, the RF model showed near-perfect sensitivity but very low specificity (≈ 0.38). This behavior indicates that the model tends to over-predict the positive class, which could reduce reliability in negative class detection (i.e., identifying non-Okataina rocks).

  3. Interpretability and Decision Transparency
    Despite being more interpretable than neural networks, RF is still an ensemble of many deep trees. Understanding individual decision paths or explaining why a specific rock was classified as “Okataina” remains difficult without post-hoc analysis (e.g., SHAP values).

  4. Sensitivity to Input Noise and Missing Data
    During prediction on the unlabeled dataset, several samples had to be removed due to missing values. RF does not natively handle missing values unless specifically engineered. This can limit robustness in real-world scenarios where geochemical data is often incomplete.

  5. Fixed Feature Importance
    The model assumes all features are equally useful unless told otherwise. In geochemical datasets, some elements may be highly collinear or irrelevant. Without embedded feature selection, RF may be vulnerable to redundancy or noise in predictors.

While Random Forest is a powerful and flexible classifier, its performance in this case required manual calibration (e.g., threshold tuning) to meet domain-specific goals such as minimizing false positives. For broader deployment or automated applications, additional safeguards—such as feature selection, uncertainty quantification, and external validation—may be necessary to ensure robust generalization.

Question B: Clustering seeds

In this question, you are not allowed to use any pre-implemented modules for performing k-means, expectation maximisation, or other clustering algorithms. You will need to provide your own implementation.

We will work with a dataset seeds.csv containing measurements of a large number of seeds. Each seed is measured in terms of “Area”, “Perimeter”, “MajorAxisLength”, “ConvexArea”, and other optical characteristics of this seed. Your goal in this question will be to find clusters in this dataset, i.e. try and find groups of “different types of seed” (the seeds come from different plants, but this information has been lost, so all we can do is try and estimate how many different types of seeds we have).

  1. Perform data preprocessing, if appropriate.

  2. Investigate the question “how many types of seeds are there”, and present your results briefly, but coherently.

  3. Opportunities for showing extra effort: The additional dataset seeds_annotated.csv contains (few) datapoints which have class labels (A,B,…,G), annotated by an expert of seeds. Use this, in combination with your results from your clustering analysis, to train a seed type prediction algorithm that takes the measurements of a seed and predicts its type. Test your algorithm in a suitable way and communicate your findings (you need to be careful not to check performance purely on the annotated training set). [You do not need to implement your algorithms yourself in this part of the question, feel free to use whatever code works best for you]

Data Preprocessing

Data Load and Wrangling

# Define the data directory path (relative to the current working directory)
data_dir <- ("../data")

# Construct full file paths to the annotated and unlabelled seed datasets
path_annotated <- file.path(data_dir, "seeds_annotated.csv")
path_unlabeled <- file.path(data_dir, "seeds.csv")

# Read the annotated data, assuming decimal commas (",") are used in numeric values
annotated <- read_csv(
  path_annotated,
  locale = locale(decimal_mark = ","),
  show_col_types = FALSE   # Suppress column type messages
)

# Read the unlabelled seed data using the same locale settings
unlabeled <- read_csv(
  path_unlabeled,
  locale = locale(decimal_mark = ","),
  show_col_types = FALSE
)

annotated <- select(annotated,-...1)  # Removes the column named ...1
unlabeled <- select(unlabeled,-...1)

Data Inspection

skim(unlabeled)
Data summary
Name unlabeled
Number of rows 13511
Number of columns 16
_______________________
Column type frequency:
numeric 16
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Area 0 1 53032.20 29297.77 20420.00 36323.50 44659.00 61289.00 254616.00 ▇▂▁▁▁
Perimeter 0 1 855.21 214.19 524.74 703.45 794.98 976.54 1985.37 ▇▆▁▁▁
MajorAxisLength 0 1 320.12 85.66 183.60 253.30 296.90 376.42 738.86 ▇▆▂▁▁
MinorAxisLength 0 1 202.23 44.94 122.51 175.79 192.39 217.02 460.20 ▇▇▁▁▁
AspectRation 0 1 1.58 0.25 1.02 1.43 1.55 1.71 2.43 ▂▇▅▂▁
Eccentricity 0 1 0.75 0.09 0.22 0.72 0.76 0.81 0.91 ▁▁▂▇▇
ConvexArea 0 1 53752.32 29748.74 20684.00 36713.00 45200.00 62249.00 263261.00 ▇▂▁▁▁
EquivDiameter 0 1 253.03 59.14 161.24 215.05 238.46 279.35 569.37 ▇▆▁▁▁
Extent 0 1 0.75 0.05 0.56 0.72 0.76 0.79 0.87 ▁▁▅▇▂
Solidity 0 1 0.99 0.00 0.92 0.99 0.99 0.99 0.99 ▁▁▁▁▇
roundness 0 1 0.87 0.06 0.49 0.83 0.88 0.92 0.99 ▁▁▂▇▇
Compactness 0 1 0.80 0.06 0.64 0.76 0.80 0.83 0.99 ▂▅▇▂▁
ShapeFactor1 0 1 0.01 0.00 0.00 0.01 0.01 0.01 0.01 ▁▃▇▃▁
ShapeFactor2 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ▇▇▇▃▁
ShapeFactor3 0 1 0.64 0.10 0.41 0.58 0.64 0.70 0.97 ▂▇▇▂▁
ShapeFactor4 0 1 1.00 0.00 0.95 0.99 1.00 1.00 1.00 ▁▁▁▁▇

No missing data found in unlabeled.

skim(annotated)
Data summary
Name annotated
Number of rows 100
Number of columns 17
_______________________
Column type frequency:
character 1
numeric 16
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Class 0 1 1 1 0 7 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Area 0 1 55220.93 32778.32 25319.00 37335.25 44505.00 63994.50 198671.00 ▇▃▁▁▁
Perimeter 0 1 864.65 228.81 590.34 707.56 792.64 998.67 1686.77 ▇▅▂▁▁
MajorAxisLength 0 1 322.85 91.19 211.57 253.04 295.36 381.25 623.27 ▇▃▅▁▁
MinorAxisLength 0 1 207.27 49.35 134.45 180.20 196.41 217.85 410.10 ▅▇▁▁▁
AspectRation 0 1 1.56 0.24 1.11 1.37 1.57 1.70 2.16 ▅▅▇▂▂
Eccentricity 0 1 0.74 0.10 0.44 0.68 0.77 0.81 0.89 ▁▂▂▇▆
ConvexArea 0 1 55914.22 33217.58 25664.00 37653.00 44921.50 65094.25 201134.00 ▇▃▁▁▁
EquivDiameter 0 1 257.32 64.32 179.55 218.03 238.05 285.44 502.95 ▇▅▁▁▁
Extent 0 1 0.75 0.05 0.60 0.72 0.77 0.79 0.83 ▁▂▅▇▆
Solidity 0 1 0.99 0.00 0.97 0.99 0.99 0.99 0.99 ▁▁▂▇▆
roundness 0 1 0.88 0.05 0.77 0.85 0.88 0.93 0.98 ▃▃▇▅▅
Compactness 0 1 0.81 0.06 0.68 0.76 0.80 0.85 0.95 ▂▆▇▃▃
ShapeFactor1 0 1 0.01 0.00 0.00 0.01 0.01 0.01 0.01 ▁▃▇▃▁
ShapeFactor2 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ▇▆▆▃▂
ShapeFactor3 0 1 0.65 0.10 0.46 0.58 0.64 0.73 0.90 ▃▇▅▃▂
ShapeFactor4 0 1 1.00 0.00 0.98 0.99 1.00 1.00 1.00 ▁▁▂▅▇

No missing data found in annotated.

Z-score Normalisation

# Select only the numeric columns from the unlabeled dataset
# This is important because scaling only applies to continuous numeric variables
unlabeled_scaled <- select(unlabeled, where(is.numeric)) |>
  
  # Apply standardization (z-score scaling) to each numeric column
  # This transforms each feature to have mean = 0 and standard deviation = 1
  scale() |>
  
  # Convert the result from a matrix back to a tibble for compatibility with tidyverse
  as_tibble()

skim(unlabeled_scaled)
Data summary
Name unlabeled_scaled
Number of rows 13511
Number of columns 16
_______________________
Column type frequency:
numeric 16
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Area 0 1 0 1 -1.11 -0.57 -0.29 0.28 6.88 ▇▂▁▁▁
Perimeter 0 1 0 1 -1.54 -0.71 -0.28 0.57 5.28 ▇▆▁▁▁
MajorAxisLength 0 1 0 1 -1.59 -0.78 -0.27 0.66 4.89 ▇▆▂▁▁
MinorAxisLength 0 1 0 1 -1.77 -0.59 -0.22 0.33 5.74 ▇▇▁▁▁
AspectRation 0 1 0 1 -2.26 -0.61 -0.13 0.50 3.43 ▂▇▅▂▁
Eccentricity 0 1 0 1 -5.79 -0.38 0.15 0.65 1.75 ▁▁▂▇▇
ConvexArea 0 1 0 1 -1.11 -0.57 -0.29 0.29 7.04 ▇▂▁▁▁
EquivDiameter 0 1 0 1 -1.55 -0.64 -0.25 0.44 5.35 ▇▆▁▁▁
Extent 0 1 0 1 -3.96 -0.63 0.21 0.76 2.37 ▁▁▅▇▂
Solidity 0 1 0 1 -14.55 -0.32 0.24 0.62 1.62 ▁▁▁▁▇
roundness 0 1 0 1 -6.44 -0.69 0.17 0.73 1.97 ▁▁▂▇▇
Compactness 0 1 0 1 -2.58 -0.61 0.02 0.56 3.04 ▂▅▇▂▁
ShapeFactor1 0 1 0 1 -3.36 -0.59 0.07 0.63 3.45 ▁▃▇▃▁
ShapeFactor2 0 1 0 1 -1.93 -0.94 -0.04 0.76 3.27 ▇▇▇▃▁
ShapeFactor3 0 1 0 1 -2.36 -0.63 -0.01 0.53 3.35 ▂▇▇▂▁
ShapeFactor4 0 1 0 1 -10.84 -0.31 0.30 0.64 1.07 ▁▁▁▁▇

The reason we do Z-score normalisation is that clustering algorithms such as k-means are distance-based (typically Euclidean distance). If one variable has a much larger scale than others (e.g., 100s vs. 0.1s), it will dominate the distance metric and bias the clustering results.

Z-score standardization ensures that all features contribute equally to the clustering process.

K‑means from Scratch

K-means Algorithms

# Compute squared distance from each row of X to a single centroid
delta_to_centroid <- function(X, centroid) {
  n <- nrow(X)
  p <- ncol(X)
  rowSums((X - matrix(centroid, n, p, byrow = TRUE))^2)  # ensure conformable dimensions
}

# Compute a distance matrix from all points to all centroids
delta_matrix <- function(X, centroids) {
  k <- nrow(centroids)
  sapply(seq_len(k), function(i) delta_to_centroid(X, centroids[i, ]))
}

# Compute total within-cluster sum of squares using f_kmeans output
compute_wss <- function(k, data, max_iter, start) {
  model <- f_kmeans(data, k, max_iter = max_iter, start = start)
  
  X <- as.matrix(data)                       # ensure matrix format
  clusters <- model$clusters                 # vector of cluster assignments
  centroids <- model$centroids               # matrix of final centroids
  
  total_withinss <- sum(sapply(seq_len(k), function(j) {
    idx <- which(clusters == j)              # indices assigned to cluster j
    if (length(idx) > 0) {
      point_set <- X[idx, , drop = FALSE]    # all points in cluster j
      centroid_row <- matrix(centroids[j, , drop = FALSE],
                             nrow = length(idx),
                             ncol = ncol(X),
                             byrow = TRUE)   # broadcast centroid for subtraction
      sum(rowSums((point_set - centroid_row)^2))  # WCSS for cluster j
    } else {
      0
    }
  }))
  
  return(total_withinss)
}

# Custom K-means implementation
f_kmeans <- function(data, k, max_iter = 100, start = 10) {
  set.seed(82171165)
  X <- as.matrix(data)                  # force to matrix
  n <- nrow(X)
  p <- ncol(X)

  centroids <- X[sample(n), , drop = FALSE][1:k, ]  # random initial centroids
  clusters <- integer(n)              # empty cluster assignment
  iter_reached <- NA

  for (iter in seq_len(max_iter)) {
    distances <- delta_matrix(X, centroids)              # distance matrix
    clusters_new <- max.col(-distances)                  # assign nearest centroid
    has_converged <- !anyNA(clusters_new) && all(clusters_new == clusters)

    if (has_converged) {
      iter_reached <- iter
      break
    }

    clusters <- clusters_new

    # Recompute centroids
    for (j in seq_len(k)) {
      idx <- which(clusters == j)
      if (length(idx) > 0) {
        centroids[j, ] <- colMeans(X[idx, , drop = FALSE])  # new centroid
      }
    }
  }

  # Recompute WCSS at the end with safe dimension handling
  total_withinss <- sum(sapply(seq_len(k), function(j) {
    idx <- which(clusters == j)
    if (length(idx) > 0) {
      point_set <- X[idx, , drop = FALSE]
      centroid_row <- matrix(centroids[j, , drop = FALSE],
                             nrow = length(idx),
                             ncol = ncol(X),
                             byrow = TRUE)
      sum(rowSums((point_set - centroid_row)^2))
    } else {
      0
    }
  }))

  # Return result object
  list(
    clusters = clusters,
    centroids = centroids,
    iter = iter_reached,
    total_withinss = total_withinss
  )
}

Custom Implementation of the K-means Clustering Algorithm:

  • Computes squared Euclidean distances between data points and centroids.

  • Assigns each data point to the nearest cluster.

  • Recomputes centroids as the mean of assigned points.

  • Iterates the above steps until convergence or a maximum number of iterations is reached.

Computation of Within-Cluster Sum of Squares (WSS):

  • For each specified number of clusters k, the algorithm computes the WSS.

  • WSS measures the compactness of clusters — lower values indicate tighter clusters.

Preparation for Determining Optimal Cluster Number:

  • WSS values for a range of k (typically 2 to 10) are stored.

  • These values can be visualised in an elbow plot to help determine the optimal number of clusters using the elbow method.

Test-toggle

Before running whole data rows (more than 10k), we should conduct a test first.

This code block verifies the fundamental decomposition identity in clustering: \[{Total \ Sum\ of\ Squares (TotSS)} = \text{Between-Cluster SS (BSS)} + \text{Within-Cluster SS (WSS)}\]

It runs the custom f_kmeans() implementation on several values of k (from 2 to 10), and for each: Calculates TotSS directly from the full dataset. Computes WSS using the result of K-means. Derives BSS by subtraction. Checks whether the identity approximately holds for each value of k, and outputs all values into a structured table. This acts as a diagnostic step to validate the correctness of the K-means implementation.

# Define a flag to control whether to run a test mode with a subset of data or the full mode. This is useful for quick debugging.
testing <- FALSE 

# Conditionally sets up a test configuration with fewer data rows and smaller k values if testing = TRUE, or sets full run parameters otherwise.
# 
# unlabeled_use is the standardised data passed to the clustering algorithm.
# 
# k_vals defines the range of clusters to evaluate.
# 
# max_iter limits the maximum number of K-means iterations per run.
if (testing) {
  message("🟡 Quick test: 500 rows, k=2:4, iter.max=5")
  unlabeled_use <- unlabeled_scaled[1:500,]; k_vals <- 2:4; max_iter <- 5
} else {
  message("🟢 Full run: all rows, k=2:10, iter.max=100")
  unlabeled_use <- unlabeled_scaled; k_vals <- 2:10; max_iter <- 100
}

# Redundantly resets k_vals, could be removed for efficiency, since it’s already defined above.
k_vals <- 2:10

# Similarly redundant; already set above, but safe to keep for clarity.
unlabeled_use <- unlabeled_scaled

# Computes the Within-Cluster Sum of Squares (WSS) for each k using the custom compute_wss() function and stores results in a numeric vector.

# vapply() is used instead of sapply() for stricter type safety.
wss_values <- vapply(k_vals, function(k) compute_wss(k, unlabeled_use, max_iter, start), numeric(1))

unlabeled_use <- unlabeled_scaled

This chunk prepares clustering evaluation by:

  • Running the custom K-means algorithm across a range of k values.

  • Measuring the WSS (compactness) for each.

  • Preparing the results for plotting the elbow curve to determine the optimal number of clusters.

# This chunk verifies the decomposition identity:
# Total Sum of Squares = Within-Cluster SS + Between-Cluster SS

debug_tbl <- purrr::map_dfr(k_vals, function(k) {
  
  # Run custom K-means with specified number of clusters
  res <- f_kmeans(unlabeled_scaled, k, max_iter)

  # Compute within-cluster sum of squares using the WSS function
  tot_within <- compute_wss(k, unlabeled_use, max_iter)

  # Compute the overall (grand) mean vector of the entire dataset
  grand_mean <- colMeans(unlabeled_use)

  # Compute total sum of squares (TotSS)
  # This measures how far each data point is from the grand mean
  totss <- sum(rowSums(
    (as.matrix(unlabeled_use) -
     matrix(grand_mean,
            nrow = nrow(unlabeled_use),
            ncol = ncol(unlabeled_use),
            byrow = TRUE))^2
  ))

  # Compute between-cluster sum of squares (BetSS)
  # This is the difference between total and within-cluster variation
  betweenss <- totss - tot_within

  # Return a tibble with current k and diagnostic values
  # identity_ok checks that TotSS ≈ WSS + BSS (with tiny numerical tolerance)
  tibble(
    k = k,
    tot_within = tot_within,
    totss = totss,
    betweenss = betweenss,
    identity_ok = abs(totss - (betweenss + tot_within)) < 1e-8
  )
})

# Display the full diagnostic table for all tested k values
print(debug_tbl)
## # A tibble: 9 × 5
##       k tot_within   totss betweenss identity_ok
##   <int>      <dbl>   <dbl>     <dbl> <lgl>      
## 1     2    128933. 216160.    87227. TRUE       
## 2     3    109560. 216160.   106600. TRUE       
## 3     4     85203. 216160.   130957. TRUE       
## 4     5     61469. 216160.   154691. TRUE       
## 5     6     55144. 216160.   161016. TRUE       
## 6     7     48475. 216160.   167685. TRUE       
## 7     8     46292. 216160.   169868. TRUE       
## 8     9     42756. 216160.   173404. TRUE       
## 9    10     40916. 216160.   175244. TRUE

Picking Best k by Elbow and Silhouette Methods

The Elbow Method is based on the idea of diminishing returns in clustering quality. As the number of clusters k increases, the total within-cluster sum of squares (WSS) always decreases—since each point is closer to its cluster center. However, this improvement is not linear. Initially, adding clusters greatly reduces WSS, but after a certain point, the marginal gain becomes minimal. This point of inflection, where the curve “bends” like an elbow, indicates a suitable number of clusters that balances compactness with model simplicity.

# Set up clustering input data and the range of k (number of clusters)
unlabeled_use <- unlabeled_scaled
k_vals <- 2:10
max_iter <- 100
start <- 10

# Compute WSS for each k using the custom K-means function
# This gives the total within-cluster sum of squares for Elbow plot
wss_values <- vapply(
  k_vals,
  function(k) compute_wss(k, unlabeled_use, max_iter, start),
  numeric(1)
)
# Create Elbow Plot (WSS vs. k)
plot_wss <- qplot(k_vals, wss_values, geom = c("point", "line"),
                  xlab = "Number of Clusters", ylab = "Total WCSS") +
  ggtitle("Elbow Method") +
  theme_minimal()

The Silhouette Method

On the other hand, evaluates the cohesion and separation of clusters from a geometric perspective. For each data point, the silhouette score compares:

  • \(a(i)\):the average distance to other points in the same cluster (intra-cluster distance)

  • \(b(i)\):the average distance to points in the nearest neighbouring cluster (inter-cluster distance)

  • The silhouette score is defined as:

    \[s(i)=\frac{b(i)-a(i)}{\max \{a(i), b(i)\}}\] Values close to 1 indicate well-separated and compact clusters, while values near 0 suggest overlapping clusters. The average silhouette width across all points serves as a global score of clustering quality. The k value that maximises this average silhouette score is often considered optimal.

By combining Elbow and Silhouette methods together allows us to cross-validate the choice of k, balancing interpretability (elbow) and structure quality (silhouette).

# Compute average silhouette width for each k
# Silhouette width measures how well points fit within their cluster vs. the next closest cluster
diss_matrix <- dist(unlabeled_use)  # Compute Euclidean distance matrix once
silhouette_values <- vapply(k_vals, function(k) {
  model <- f_kmeans(unlabeled_use, k, max_iter = max_iter, start = start)  # run clustering
  if (!is.null(model$clusters)) {
    sil <- silhouette(model$clusters, diss_matrix)  # silhouette object
    mean(sil[, 3])  # extract average silhouette width
  } else {
    NA_real_  # in case of error or no clusters
  }
}, numeric(1))

# Create Silhouette Plot (Avg. silhouette width vs. k)
plot_sil <- qplot(k_vals, silhouette_values, geom = c("point", "line"),
                  xlab = "Number of Clusters", ylab = "Average Silhouette Width") +
  ggtitle("Silhouette Method") +
  theme_minimal()

# Show the silhouette plot followed by the elbow plot
plot_sil
Elbow and Silhouette plots for determining optimal k

Elbow and Silhouette plots for determining optimal k

plot_wss
Elbow and Silhouette plots for determining optimal k

Elbow and Silhouette plots for determining optimal k

  • The silhouette method gives a quality score (higher is better).

  • The elbow method helps locate the “knee point” where WSS reduction starts to slow down.

  • Running both provides a cross-validation of the optimal number of clusters.

Elbow Method – Interpretation

The total within-cluster sum of squares (WSS) decreases as the number of clusters(k) increases, which is expected. The “elbow” point appears around k = 5 or k=7: Before k = 5, the reduction in WSS is sharp, while k=7 is the second “elbow” point. After k = 5, the gains in compactness diminish, indicating diminishing returns. This suggests that 5 or 7 clusters may be a good balance between underfitting and overfitting.

Silhouette Method – Interpretation

The average silhouette width is highest at k = 2, with a value close to 0.4. However, the silhouette score drops significantly at k = 3, then peaks again slightly at k = 5, before steadily declining. A local maximum at k = 5 suggests that this value produces relatively well-separated and compact clusters compared to neighbouring options. Although k = 2 achieves the highest silhouette score, it may be too coarse (only 2 clusters). The local peak at k = 5 offers a more meaningful trade-off between separation and granularity.

So, our final optimal clustering number should suggest k=5.

Aligning Unsupervised Clusters with Annotated Classes

# ══════════════════════════════════════════════════════════════
# Assign inferred class labels to each cluster by comparing centroid proximity
# ══════════════════════════════════════════════════════════════

# Step 0: Run K-means clustering with k = 5 (determined earlier from Elbow/Silhouette)
res_final <- f_kmeans(unlabeled_scaled, k = 5, max_iter = 100)

# ──────────────────────────────────────────────────────────────
# Step 1: Compute cluster centroids in the scaled unlabeled dataset
cluster_centroids <- unlabeled_scaled %>%
  as.data.frame() %>%
  mutate(cluster = res_final$clusters) %>%  # Assign cluster labels
  group_by(cluster) %>%
  summarise(across(everything(), mean), .groups = "drop")  # Mean per cluster

# ──────────────────────────────────────────────────────────────
# Step 2: Compute centroids of known classes in scaled annotated data
annotated_scaled <- annotated %>%
  mutate(Class = as.factor(Class)) %>%
  select(-Class) %>%
  scale() %>%
  as.data.frame()

# Bind class labels back after scaling
annotated_centroids <- annotated_scaled %>%
  mutate(Class = annotated$Class) %>%
  group_by(Class) %>%
  summarise(across(everything(), mean), .groups = "drop")

# ──────────────────────────────────────────────────────────────
# Step 3: Calculate Euclidean distances between cluster and class centroids
cluster_matrix <- as.matrix(select(cluster_centroids, -cluster))
class_matrix   <- as.matrix(select(annotated_centroids, -Class))
distance_matrix <- as.matrix(dist(rbind(cluster_matrix, class_matrix)))

# Extract k × m distance submatrix: rows = clusters, columns = classes
k <- nrow(cluster_matrix)
m <- nrow(class_matrix)
prox_matrix <- distance_matrix[1:k, (k + 1):(k + m)]

# ──────────────────────────────────────────────────────────────
# Step 4: Assign each cluster to the closest known class
closest_class_index <- apply(prox_matrix, 1, which.min)  # for each cluster, find closest class
closest_class_labels <- annotated_centroids$Class[closest_class_index]

# Create a named vector: cluster ID → class label
cluster_to_label <- setNames(as.character(closest_class_labels), cluster_centroids$cluster)

# Add assigned class labels to each unlabeled sample (based on its cluster)
unlabeled_use <- unlabeled_scaled %>%
  as.data.frame() %>%
  mutate(cluster = res_final$clusters) %>%
  mutate(label = cluster_to_label[as.character(cluster)])

This chunk aligns unsupervised cluster assignments with known class labels by comparing centroids in the standardised feature space. Each cluster is matched to the closest annotated class using Euclidean distance between centroids, enabling interpretable labeling of previously unlabeled data. This provides a bridge between unsupervised clustering and semi-supervised classification.

Visualisation of Clusters in by PCA and HCPC method

This part outlines the implementation of Principal Component Analysis (PCA) on the standardised feature set of the unlabelled dataset to reduce dimensionality and visualise clustering results. The first three principal components (PC1, PC2, PC3) are extracted and plotted in a 3D scatter plot, with each data point coloured according to its inferred class label.

# Perform Principal Component Analysis (PCA) on the scaled unlabeled dataset
pca_result <- PCA(unlabeled_scaled, graph = FALSE)

# Extract coordinates of individuals on the first 3 principal components
pca_coords <- as.data.frame(pca_result$ind$coord[, 1:3])
colnames(pca_coords) <- c("PC1", "PC2", "PC3")

# Assign predicted class labels to each observation for colouring
pca_coords$Class <- unlabeled_use$label

# Define a discrete colour palette using RColorBrewer
pal <- brewer.pal(max(3, length(unique(pca_coords$Class))), "Set2")

# Create an interactive 3D scatter plot of the first three principal components
plot_ly(data = pca_coords,
        x = ~PC1, y = ~PC2, z = ~PC3,
        color = ~Class, colors = pal,
        type = "scatter3d", mode = "markers",
        marker = list(size = 3)) |>
  layout(title = "PCA – Colored by Assigned Class Labels",
         legend = list(title = list(text = "Class")),
         scene = list(
           xaxis = list(title = "PC1"),
           yaxis = list(title = "PC2"),
           zaxis = list(title = "PC3")
         ))
# Perform HCPC (Hierarchical Clustering on Principal Components) and 3D visualisation
if (!exists("res.pca") || !exists("res.hcpc")) {
  hcpc_data <- scale(unlabeled)                 # Standardise the original unlabeled data
  res.pca <- PCA(hcpc_data, graph = FALSE)      # Run PCA on scaled data (no graph output)
  res.hcpc <- HCPC(res.pca, graph = FALSE)      # Run HCPC on PCA results (no graph output)
}

# Extract coordinates of the first three principal components
pca_coords <- as.data.frame(res.pca$ind$coord[, 1:3])
colnames(pca_coords) <- c("PC1", "PC2", "PC3")

# Add cluster information (from earlier k-means) for colouring
pca_coords$cluster <- factor(unlabeled_use$cluster)

# Create colour palette
pal <- brewer.pal(max(3, length(levels(pca_coords$cluster))), "Set2")

# Interactive plot for HTML output
if (knitr::is_html_output()) {
  plot_ly(
    data = pca_coords,
    x = ~PC1, y = ~PC2, z = ~PC3,
    type = "scatter3d", mode = "markers",
    color = ~cluster, colors = pal,
    marker = list(size = 4)
  ) |>
    layout(
      scene = list(
        xaxis = list(title = "PC1"),
        yaxis = list(title = "PC2"),
        zaxis = list(title = "PC3")
      ),
      legend = list(title = list(text = "Cluster"))
    )
} else {
  # Fallback 2D plot for non-HTML output (e.g. PDF)
  fviz_cluster(
    res.hcpc,
    geom = "point", repel = TRUE,
    show.clust.cent = TRUE,
    axes = c(1, 2),
    palette = pal
  )
}

This code block performs Hierarchical Clustering on Principal Components (HCPC) using the FactoMineR and factoextra packages.

The K-means algorithm produced 5 clusters based on the elbow and silhouette criteria, whereas HCPC (Hierarchical Clustering on Principal Components) automatically selected 3 clusters. This discrepancy highlights the methodological difference: K-means relies on pre-specified k and optimises intra-cluster compactness, while HCPC infers cluster number from hierarchical structure, aiming to capture global data separation. These complementary results provide insights at different levels of granularity.

pca_coords2 <- as.data.frame(pca_result$ind$coord[, 1:3])
colnames(pca_coords2) <- c("PC1", "PC2", "PC3")
pca_coords2$label <- unlabeled_use$label_proximity

pal2 <- brewer.pal(max(3, length(unique(pca_coords2$label))), "Set2")

plot_ly(data = pca_coords2, 
        x = ~PC1, y = ~PC2, z = ~PC3,
        color = ~label, colors = pal2,
        type = "scatter3d", mode = "markers",
        marker = list(size = 3)) |>
  layout(title = "PCA - Colored by Class Label (Centroid Proximity)",
         legend = list(title = list(text = "Class")),
         scene = list(
           xaxis = list(title = "PC1"),
           yaxis = list(title = "PC2"),
           zaxis = list(title = "PC3")
         ))

Classification of Annotated Data

svmRadial Method

# Ensure Class is a factor
annotated$Class <- as.factor(annotated$Class)

# Remove class temporarily, scale, then add back
ann <- select(annotated, -Class) |> scale() |> as_tibble()
ann$Class <- annotated$Class

# Create training/test split
set.seed(82171165)
idx <- createDataPartition(ann$Class, p = 0.7, list = FALSE)
train <- ann[idx, ]
test  <- ann[-idx, ]

# Train classifier
model <- train(Class ~ ., data = train, method = "svmRadial")

# Predict and evaluate
pred <- predict(model, test)
confusionMatrix(pred, test$Class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction A B C D E F G
##          A 1 0 0 0 0 0 0
##          B 0 1 0 0 0 0 0
##          C 0 0 4 0 1 0 0
##          D 0 0 0 6 0 0 0
##          E 0 0 0 0 1 0 0
##          F 0 0 0 0 0 6 0
##          G 1 0 0 0 0 0 6
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9259          
##                  95% CI : (0.7571, 0.9909)
##     No Information Rate : 0.2222          
##     P-Value [Acc > NIR] : 1.014e-14       
##                                           
##                   Kappa : 0.9085          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E Class: F
## Sensitivity           0.50000  1.00000   1.0000   1.0000  0.50000   1.0000
## Specificity           1.00000  1.00000   0.9565   1.0000  1.00000   1.0000
## Pos Pred Value        1.00000  1.00000   0.8000   1.0000  1.00000   1.0000
## Neg Pred Value        0.96154  1.00000   1.0000   1.0000  0.96154   1.0000
## Prevalence            0.07407  0.03704   0.1481   0.2222  0.07407   0.2222
## Detection Rate        0.03704  0.03704   0.1481   0.2222  0.03704   0.2222
## Detection Prevalence  0.03704  0.03704   0.1852   0.2222  0.03704   0.2222
## Balanced Accuracy     0.75000  1.00000   0.9783   1.0000  0.75000   1.0000
##                      Class: G
## Sensitivity            1.0000
## Specificity            0.9524
## Pos Pred Value         0.8571
## Neg Pred Value         1.0000
## Prevalence             0.2222
## Detection Rate         0.2222
## Detection Prevalence   0.2593
## Balanced Accuracy      0.9762

Accuracy: 92.59% The model correctly classified ~93% of the test samples. 95% CI: (0.76, 0.99) The confidence interval for accuracy indicates the model is robust.

Class C and G: High sensitivity but slightly lower precision — suggesting that while the model correctly detects these classes, there are some false positives.

Class A: Only one out of two Class A samples was correctly classified — low recall but high precision.

The model performs very well overall, especially considering the small sample size and class imbalance. It maintains perfect recall (1.0) for 5 out of 7 classes. Misclassifications are minimal and generally involve classes with few examples (e.g., A and E), which is expected in low-data scenarios. The classifier is suitable for practical seed classification and generalises well to unseen data.